Skip to content

feat(optimize): MCP tool coverage detector with cache-aware costing#223

Merged
iamtoruk merged 4 commits intogetagentseal:mainfrom
ozymandiashh:feat/mcp-tool-coverage
May 5, 2026
Merged

feat(optimize): MCP tool coverage detector with cache-aware costing#223
iamtoruk merged 4 commits intogetagentseal:mainfrom
ozymandiashh:feat/mcp-tool-coverage

Conversation

@ozymandiashh
Copy link
Copy Markdown
Contributor

Summary

Closes #2.

Adds a per-tool optimizer finding for MCP servers whose schema is loaded on every turn but rarely invoked. Builds on the existing server-level detectUnusedMcp (zero invocations) by reporting partial-use cases like "loaded 54 tools, called 0" or "loaded 26 tools, called 2 (8% coverage)".

Smoke-tested on a real account: 7 servers flagged across 93 sessionsoffice-word-mcp 0/54, notebooklm-mcp 0/38, office-ppt-mcp 0/37, excel-mcp-server 0/25, github-mcp-server 2/26, peekaboo 3/22, plus claude_ai_Asana.

Inventory source

Claude Code's JSONL writes attachment.deferred_tools_delta entries whose addedNames array lists the exact tools available at that turn — including every fully-qualified mcp__<server>__<tool> name. We union across all delta entries in a session (not just the first) because tool availability can change mid-session when MCP config reloads or a subagent inherits a different tool set.

Names that don't match the mcp__<server>__<tool> shape with both segments non-empty are rejected at extraction so downstream split('__') consumers can't be poisoned.

Token-savings estimation

MCP tool schemas live in the cached prefix of the system prompt:

  • Each cache-creation event (rebuilds happen every ~5 minutes of inactivity) pays the full input price.
  • Subsequent turns pay the cache-read discount (~10% of input).
  • Each call's contribution is capped at its observed cacheCreationInputTokens / cacheReadInputTokens so we never claim more MCP overhead than the call's own cache buckets could contain.

When multiple servers are flagged, costing is a single combined pass: the per-call cap applies to the total unused-schema budget across all flagged servers, not per server. Two flagged servers can't independently claim the same call's cache bucket and overstate tokensSaved.

Correctness invariants

  • A session counts toward loadedSessions (and toward the cost estimate) only if its observed inventory included the server. Pure invocation-only sessions, where the server appears in mcpBreakdown or call.mcpTools without any matching deferred_tools_delta, do not satisfy the >= 2 sessions threshold on their own.
  • Coverage is computed against the inventory only: invocations of names not present in any observed inventory (older config, hallucinated tool, typo) do not inflate toolsInvoked and cannot drive unusedCount negative. toolsInvoked is derived as inventory.size - unusedTools.length to keep both numbers consistent.
  • detectUnusedMcp and the new detector are explicitly disjoint: detectUnusedMcp skips servers that the coverage detector will actually report (i.e. those clearing its thresholds), not every server that happens to be in any inventory. A small inventoried-but-uninvoked server below the coverage thresholds still gets flagged as "configured but never called."

Thresholds

  • > 10 tools available (small servers are noise)
  • < 20% coverage
  • >= 2 sessions with observed inventory
  • High impact when total effective tokens >= 200_000 or >= 3 servers flagged

Changes

  • src/types.ts: optional mcpInventory: string[] on SessionSummary. Provider-agnostic field; currently populated only by the Claude parser.
  • src/parser.ts: extractMcpInventory walks all entries, validates fully-qualified names, returns sorted unique list. buildSessionSummary passes it through; the field is omitted when empty so JSON exports stay clean.
  • src/optimize.ts: aggregateMcpCoverage, estimateMcpSchemaCost (single- and multi-server signatures), detectMcpToolCoverage. Wired into scanAndDetect. detectUnusedMcp updated to be disjoint with the new detector.
  • tests/mcp-coverage.test.ts: 23 cases covering aggregation, costing, combined-cap behaviour, threshold gates, invocation-only-session filtering, foreign-tool invocations, cache rebuild events, write+read on the same call, multi-server pluralisation, backward-compat single-server signature.
  • tests/parser-mcp-inventory.test.ts: 12 cases for the JSONL extractor including malformed name rejection (mcp__server, mcp__server__, mcp____tool) and tolerant attachment parsing.
  • CHANGELOG.md: entry under Unreleased / Added (CLI).

Scope notes

  • Claude-only. deferred_tools_delta is Claude Code-specific. The field is provider-agnostic on SessionSummary so other parsers can populate it later, but no other provider exposes the same telemetry today.
  • No public API change beyond the new mcpInventory optional field. All existing schemas, exports, and CLI flags are unaffected.
  • No dashboard panel in this PR. The optimizer is the lowest-friction surfacing path, which fits the existing waste-finding model. Open to following up with a panel if you'd like.

Test plan

npx tsc --noEmit       # 0 errors
npx vitest run         # 34 files, 462 passed (was 427 baseline + 35 new)
npm run build          # success
node dist/cli.js optimize -p week
# -> "7 MCP servers with low tool coverage" finding (High)
# -> existing "configured but never used" still flags servers below the
#    coverage thresholds (e.g. `oura` in my data)

Reviews considered

Design and implementation went through three rounds of code review (Codex GPT-5.5 high, Gemini 3.1 Pro Preview, an internal Sonnet reviewer) before this PR. Concrete findings addressed end-to-end:

  • Duplicate findings between the legacy and new detector
  • loadedSessions counted from invocation-only sessions, diluting the threshold
  • toolsInvoked counting tools not present in inventory
  • continue after cacheCreationInputTokens skipping the same call's cacheReadInputTokens
  • extractMcpInventory accepting malformed names
  • Cache rebuilds (multiple cacheCreation events per session)
  • Cumulative tokensSaved over-count when multiple servers flagged share a cache bucket
  • Inventory-vs-breakdown semantic mismatch between aggregator and cost estimator
  • Blind spot in detectUnusedMcp for inventoried-but-uninvoked small servers

Adds a per-tool optimizer finding for MCP servers whose schema is loaded
on every turn but rarely invoked. Builds on the existing server-level
`detectUnusedMcp` (zero invocations) by reporting partial-use cases:
"loaded 54 tools, called 0" or "loaded 26 tools, called 2 (8% coverage)".

Inventory comes from Claude Code's JSONL `attachment.deferred_tools_delta`
entries: `addedNames` lists the exact tools available at that turn,
including every fully-qualified `mcp__<server>__<tool>` name. We union
across all delta entries in a session (not just the first) because tool
availability can change mid-session when the user reloads MCP config or
a subagent inherits a different tool set. Names that don't match the
`mcp__<server>__<tool>` shape with both segments non-empty are rejected
at extraction so downstream `split('__')` consumers can't be poisoned.

Token-savings estimates are cache-aware. MCP tool schemas live in the
cached prefix of the system prompt: a session pays the full input price
on each cache-creation turn (rebuilds happen every ~5 minutes of
inactivity) and the cache-read discount on subsequent turns. Each call's
contribution is capped at its observed `cacheCreationInputTokens` /
`cacheReadInputTokens` so we never claim more MCP overhead than the
call's own cache buckets could contain.

When multiple servers are flagged, costing happens in a single combined
pass: the per-call cap applies to the total unused-schema budget across
all flagged servers, not per server. Two flagged servers cannot both
independently claim the same call's cache bucket, which would otherwise
overstate `tokensSaved` and misclassify findings as high impact.

A session counts toward `loadedSessions` (and toward the cost estimate)
only if its observed inventory included the server. Pure invocation-only
sessions, where the server appears in `mcpBreakdown` or `call.mcpTools`
without any matching `deferred_tools_delta`, do not satisfy the
`>= 2 sessions` threshold on their own. The same invariant applies in
`estimateMcpSchemaCost` so the two passes agree.

Coverage is computed against the inventory only: invocations of names
not present in any observed inventory (older config, hallucinated tool,
typo) do not inflate `toolsInvoked` and cannot drive `unusedCount`
negative. `toolsInvoked` is derived as `inventory.size - unusedTools.length`
to keep both numbers consistent.

`detectUnusedMcp` and the new detector are explicitly disjoint:
`detectUnusedMcp` skips servers that the coverage detector will report,
not every server that happens to be in any inventory, so a small
inventoried-but-uninvoked server below the coverage thresholds still
gets flagged as "configured but never called."

Thresholds for the coverage finding:
- > 10 tools available (small servers are noise)
- < 20% coverage
- >= 2 sessions with observed inventory
- High impact when total effective tokens >= 200_000 or >= 3 servers flagged

Smoke-tested on a real account: 7 servers flagged across 93 sessions
(`office-word-mcp` 0/54, `notebooklm-mcp` 0/38, `office-ppt-mcp` 0/37,
`excel-mcp-server` 0/25, `github-mcp-server` 2/26, `peekaboo` 3/22, plus
`claude_ai_Asana`). Combined-cap costing keeps `tokensSaved` honest.

Changes:
- src/types.ts: optional `mcpInventory: string[]` on `SessionSummary`.
  Provider-agnostic field; currently populated only by the Claude parser.
- src/parser.ts: `extractMcpInventory` walks all entries, validates
  fully-qualified names, returns sorted unique list. `buildSessionSummary`
  passes it through; field is omitted when empty so JSON exports stay
  clean.
- src/optimize.ts: `aggregateMcpCoverage`, `estimateMcpSchemaCost`
  (single- and multi-server signatures), `detectMcpToolCoverage`. Wired
  into `scanAndDetect`. `detectUnusedMcp` updated to disjoint with the
  new detector.
- tests/mcp-coverage.test.ts: 23 cases covering aggregation, costing,
  combined-cap behaviour, threshold gates, invocation-only-session
  filtering, foreign-tool invocations, cache rebuild events, write+read
  on the same call, multi-server pluralisation.
- tests/parser-mcp-inventory.test.ts: 12 cases for the JSONL extractor
  including malformed name rejection and tolerant attachment parsing.
- CHANGELOG.md: entry under Unreleased / Added (CLI).

Closes getagentseal#2
@iamtoruk
Copy link
Copy Markdown
Member

iamtoruk commented May 5, 2026

Solid work. Clean separation between extractMcpInventory (parser), aggregateMcpCoverage (aggregation), estimateMcpSchemaCost (costing), and detectMcpToolCoverage (finding emission). Each piece is independently testable and tested.

35 new tests covering edge cases well: malformed names, invocation-only sessions, foreign tools, cache rebuilds, multi-server cap, threshold gates, pluralization.

Comments are dense but justified here. The domain (cache pricing, inventory semantics) is genuinely complex and the invariants are non-obvious.

Two things to address:

  1. Double aggregation: aggregateMcpCoverage is called in both detectMcpToolCoverage and the updated detectUnusedMcp. Should compute once and pass the result.

  2. The estimateMcpSchemaCost backward-compat overload accepts number | Record<string, number> as first arg and string | string[] as third. The single-server path does { [serverOrServers as string]: unusedToolCounts } which is safe only because the caller ensures the types match. TypeScript function overloads would be cleaner than runtime type checks.

Otherwise this is one of the highest quality external PRs on this repo. Nice iteration.

@ozymandiashh
Copy link
Copy Markdown
Contributor Author

ozymandiashh commented May 5, 2026

Thanks for the review. I addressed both cleanup points in e46b20b:

  • scanAndDetect now computes MCP coverage once and passes it into both MCP detectors, so we avoid the duplicate aggregateMcpCoverage(projects) pass.
  • estimateMcpSchemaCost now exposes typed overloads for the single-server and multi-server call shapes, with guarded normalization inside the implementation instead of relying on unsafe casts.

Validated locally with:

  • npx tsc --noEmit
  • npx vitest run tests/mcp-coverage.test.ts
  • npx vitest run
  • npm run build

Please take another look when you have a chance.
Thx for the kind words. I'm thinking about making a confidence meter, how sure code burn is on the usage costs and maybe also a recommendation if you should change to api or subscription, what do you think? Or do you want me to tackle more of the issues that are left first?

iamtoruk added 2 commits May 4, 2026 20:11
- Use 1.25x multiplier for cache-write tokens to match Anthropic's
  actual pricing (was incorrectly using 1x)
- Shell-quote server names in `claude mcp remove` fix text to prevent
  issues with unusual server names
@iamtoruk iamtoruk merged commit 4ac8e8d into getagentseal:main May 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: show token overhead from unused MCP tool definitions

2 participants